About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.

Hi, I am Wenhsuan!

I hope that I could learn some useful techique with R, and that I could analysis data after the semester ends.

*Github repository


Regression and model validation

Describe the work you have done this week and summarize your learning.

The is a data with 60 variables. Through analyzing the data, we hope to undertand what are the important variables which is related to exam points.

step 1:Data Cleaning To analyze the data, the first step is to clean the data (scale the “Attitude” column), and select the information we are interested in. Since there are too many variables (183 observations and 60 variables), and this would make it hard to make analysis, I combine some of the variables, and put them in three big categories, which are deep, surface, strategic. After that, I average the value of deep_columns, surface_columns, strategic_columns.In the end, I keep the rows where points is greater than zero.These are what I do in “data cleaning” step.

step 2: Show a graphical overview of the data and show summaries of the variables in the data.

## Loading required package: ggplot2

##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

After data cleaning step, the data now has 166 observations and 7 variables and I start drawing some plots.

The plots show us the distribution between two variables, and each varible. They also show us data distribytion by gender. I found that there is a positive correlation between attitude and points (0.43), deep and surf has negative correlation. From the box charts, I found that the data of age, deep questions is more condensed. However, the value age has lots of outliers.

From the summary, we see below information, which shows the minimum, maximum, mean, quartile value of each variable(age, attitude, deep)

step 3:Choose attitude, deep question, strategic question variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
## 
## Coefficients:
## (Intercept)     attitude         deep         stra  
##     11.3915       3.5254      -0.7492       0.9621
## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

I choose points as Y value, and attitude, deep question, strategic question as X value to create a multiple regression. According to the summary result, we can see that p-value is 2.521e-08, which is smaller than 0.05, hence we can say that the model is reasonable. However, since standard error is quite large, the estimation isn’t too precise. Though multiple R-squared, adjusted R-squared are both low, which are 0.2097, 0.195, we couldn’t underestimate the model’s explanatory ability, since lots of factors should be taken into account, and that R-squared isn’t the only element to consider a regresison model.

step 4: Produce Residuals vs Fitted plot, Normal QQ-plot and Residuals vs Leverage plot.

1.Residuals vs Fitted plot: A “good” residuals vs. fitted plot should has no obvious outliers, and be generally symmetrically distributed around the 0 line without particularly large residuals. From the plot, we can see that X and Y values are not correlated, hence this is a suitable model.

2.Normal QQ-plot: According to the theory, if both sets of quantiles come from the same distribution, we should see the points forming a line that’s roughly straight. Also, the points should fall approximately along the 45 degree reference line. From the plot, we could see that the points fall approximately on the 45-degree reference line, which means that the data sets come from similar distributions.

3.Residuals vs Leverage plot: This plot helps identify influential data points on the model. The points which are outside the red dashed Cook’s distance line are the points that would be influential in the model, and removing them would likely noticeably alter the regression results.


Logistic regression

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This is adata discussing about students’ alchohol consumption.There are 382 observations and 35 variables. Variables include student’s, sex, age,family size,alcohol consumption, parents’education status, job, and so on. Through the analysis, I want to study the relationships between high/low alcohol consumption and some of the other variables in the data.

I assume that “studytime”(weekly study time), “failures”(number of past class failures), “goout”(frequency of going out with friends), “freetime”(free time after school), are important variables which have strong relationship with the consumption of alchohol. My hypothesis is that students who have less study time per week, fail more classes, go out with friends more often, have more free time after school consume more alchohol.

1.Numerically and graphically explore the distributions

## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_study_time
##   <fct> <lgl>    <int>           <dbl>
## 1 F     FALSE      157            2.34
## 2 F     TRUE        41            2   
## 3 M     FALSE      113            1.88
## 4 M     TRUE        71            1.62
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_failures
##   <fct> <lgl>    <int>         <dbl>
## 1 F     FALSE      157         0.204
## 2 F     TRUE        41         0.439
## 3 M     FALSE      113         0.239
## 4 M     TRUE        71         0.479

According to the summary statistics of study time group by sex and high_use, female who has shorter study time tend to consume more alchohol (more than 2 times per week), and those who studies longer tend not to consume that much alchohol(less than 2 times per week). The phenomenon is same for male. The result corresponds to my assumption.

Through the summary statistics of failures group by sex and high_use, female who failed more classes in the past tend to consume more alchohol, and those who failed less classes tend not to consume that much alchohol. This also happens at male. The result corresponds to my assumption.

The boxplots shows that for variable “goout”, female who consumes more alchohol goes out more. Male also has the same same situation. The result corresponds to what i assumed. And for variable “freetime”, female who consumes more alchohol has more freetime after school; however, the phenomenon is not too significant for male. The result is similiar to my assumption, but doesn’t exactly match.

2.Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.

I choose “studytime”, “failures”, “goout”, “freetime” as four X variables, and fit them into the model which Y is “high_use” (high/low alcohol consumption). I also separate the data into male and female, in order to dig more into the data. Below are what I found from the model.

## 
## Call:
## glm(formula = high_use ~ studytime + failures + goout + freetime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8214  -0.7528  -0.5442   0.8552   2.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.36957    0.62399  -3.797 0.000146 ***
## studytime   -0.57481    0.16784  -3.425 0.000615 ***
## failures     0.19303    0.16899   1.142 0.253334    
## goout        0.70490    0.12039   5.855 4.77e-09 ***
## freetime     0.07209    0.13531   0.533 0.594163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 395.17  on 377  degrees of freedom
## AIC: 405.17
## 
## Number of Fisher Scoring iterations: 4
## (Intercept)   studytime    failures       goout    freetime 
## -2.36956938 -0.57481413  0.19303395  0.70489610  0.07209276
## Waiting for profiling to be done...
##                     OR     2.5 %    97.5 %
## (Intercept) 0.09352099 0.0267779 0.3109179
## studytime   0.56280947 0.4007339 0.7752293
## failures    1.21292398 0.8699068 1.6929038
## goout       2.02363642 1.6081702 2.5811264
## freetime    1.07475503 0.8240334 1.4026604

The summary shows that standard error for “studytime”, “failures”, “goout”, “freetime” are 0.79, 0.79, 0.74, 0.75, which are comparatively small. Smaller standard error indicates that sample mean and the population mean is more similar, which means that sample data has a stronger explanatory power to the population. The coefficient for “studytime”, “failures”, “goout”, “freetime” are -2.94, -2.18, -1.67, -2.3, which means that these variables are all highly correlated with Y(“high_use”).

The odds ratio of “studytime” is 0.654(less than 1), “failures” is 1.34(higher than 1), “goout” is 2.114(higher than 1), “freetime” is 1.164(higher than 1), which means that “failures”, “goout”, “freetime” is positively associated with “high_use”. According to the confidence intervals,

According to the above result, I now figure out that “failures”, “goout”, “freetime” are important variables, which are correlated with high/low alcohol consumption. Since the odds ratio of “studytime” is less than one, it isn’t a factor which has high relationship with high/low alcohol consumption.

3.Using the variables which has statistical relationship with high/low alcohol consumption to explore the predictive power of you model.

##         prediction
## high_use FALSE TRUE
##    FALSE   248   22
##    TRUE     76   36

Since I found that “studytime” isn’t an important factor, I remove this variable, and predict a model. According to the confusion matrix, we can calculate the precision is 248/(248+76) = 0.77, and the recall is 248/ (248+22) = 0.92. This means that the model has high recall, low precision, which means that most of the positive examples are correctly recognized, but there are a lot of false positives.The average number of wrong predictions in training data is 0.2565445, and the average number of wrong predictions in the cross validation 0.2486911. This means that the error of the model is quite high(around 25%).


Clustering and classification

1.Explore the structure and the dimensions of the data

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14

Boston data (housing values in suburbs of Boston) has 506 observations and 14 variables. Variables include “crim”(per capita crime rate by town), “zn”(proportion of residential land zoned for lots over 25,000 sq.ft), “rm”(average number of rooms per dwelling). Through analyzing the data, we hope to understand what are the features which affect housing price.

2.Show a graphical overview of the data and show summaries of the variables in the data.

## corrplot 0.84 loaded
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The correlation matrix shows the relationship between two variables. If the dot is blue, it means that the two variables are positively correlated; if the dot is red, it means that the two variables are negatively correlated. As the color gets darker (blue or red), it means that the two variables have a stronger correlation. The dot is white or almost white if the two variables have weak correlation or no correlation. For example, “rad” and “tax” have high positive correlation; “lstat” and “medv”, “age” and “dis” have negative correlation. According to the summary, I see the minimum, maximum, mean, quartile value of each variable. For example, the average number of rooms per dwelling is 6.285, and the average crime rate 3.613.

3.Standardize the dataset and print out summaries of the scaled data.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

After scaling the data, the mean of each variable is 0, and the difference between each variable become smaller.

4.Create a categorical variable of the crime rate in the Boston dataset, and drop the old crime rate variable from the dataset.

## crime
##      low  med_low med_high     high 
##      127      126      126      127

5.Divide the dataset to train and test sets, so that 80% of the data belongs to the train set, and fit the linear discriminant analysis on the train set.

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2549505 0.2500000 0.2400990 
## 
## Group means:
##                  zn      indus         chas        nox         rm
## low       0.9172982 -0.8791733 -0.081207697 -0.8600290  0.5015735
## med_low  -0.1248659 -0.3220393 -0.004759149 -0.5623348 -0.1286361
## med_high -0.3581843  0.1684602  0.195445218  0.4343774  0.1316144
## high     -0.4872402  1.0149946  0.011791568  1.0632950 -0.4575021
##                 age        dis        rad        tax     ptratio
## low      -0.8813803  0.8301270 -0.7053471 -0.7605381 -0.48217523
## med_low  -0.3081105  0.3387474 -0.5458997 -0.4705499 -0.05076457
## med_high  0.4237496 -0.3794713 -0.4155974 -0.3161717 -0.40340755
## high      0.8210526 -0.8680522  1.6596029  1.5294129  0.80577843
##                black       lstat        medv
## low       0.37215747 -0.77975886  0.60342015
## med_low   0.32464046 -0.12448928  0.00625033
## med_high  0.09255862  0.02319678  0.22971777
## high     -0.70475064  0.83894493 -0.65191096
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.138147603  0.60596054 -1.00809030
## indus   -0.015403765 -0.23770783  0.05753724
## chas    -0.051622705  0.01937468  0.13664845
## nox      0.241882493 -0.98643384 -1.26407958
## rm      -0.109956056 -0.09876600 -0.16149914
## age      0.356291976 -0.34186011 -0.06535453
## dis     -0.109935819 -0.34586588  0.09341477
## rad      3.478763696  0.87430205 -0.30998827
## tax     -0.006822419  0.10700558  0.77730241
## ptratio  0.121113638  0.03940028 -0.21760293
## black   -0.160509219  0.03305790  0.13390020
## lstat    0.145193143 -0.22247044  0.40482985
## medv     0.139507560 -0.41137725 -0.12478686
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0351 0.0127

6.Save the crime categories from the test set and then remove the categorical crime variable from the test dataset, then predict the classes with the LDA model on the test data.

##           predicted
## correct    low med_low med_high high
##   low       16       8        0    0
##   med_low    4      15        4    0
##   med_high   0      12       12    1
##   high       0       0        1   29

The result of the cross tabulation shows the relation between the prediction and the correct answer. For example, there are 21 when the preidction is low and the correct answer is low, and there are 5 when the preidction is low and the correct answer is med_low.

7.Reload the Boston dataset and standardize the dataset, and calculate the distances between the observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

I calculate the distance with Euclidean distance measure and Manhattan distance measure. According to the result, the value from Manhattan distance measure is larger than Euclidean distance measure.(e.g. the median of Euclidean distance measure is 4.8241, and the median of Manhattan distance measure is 13.5488.)

8.Run k-means algorithm on the dataset.Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters and interpret the results.

The visualization of clusters show the distribution of clusters of different variables. Since we set 3 clusters, there are three colors in each plots, representing each cluster. The line chart shows that when the cluster is around 1.25, the total within sum of square is the highest.


Dimensionality Reduction Techniques

1.Show a graphical overview of the data and show summaries of the variables in the data.

##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Human data includes 155 observations and 8 variables. The variables are “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F”.

GGpairs shows the correlation between two variables, and I find that “Ado.Birth” and “Edu.Exp”, “Ado.Birth” and “Life.Exp”, “Mat.Mor” and “Edu.Exp”, “Mat.Mor” and “Life.Exp” have high negative correlation; “Life.Exp” and “Edu.Exp”, “Ado.Birth” and “Mat.Mor”have high positive correlation.

Corplot shows a better visualization than ggpairs, it shows the correaltion between each two variabes with different color. The two variables are more negative correlated if the color is red, and they are more positive correlated if the color is blue. However, corplot only shows general relationship between two variables, it can’t show the exact correlation.

2.Perform principal component analysis (PCA) on the not standardized human data.

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Since the variables are not standarized, the standard deviation is big, which means the value of the data is discreted. From the biplot, we can see that most data of the countries cluster together, the variables also have the same problem.

3.Standardize the variables in the human data and repeat the above analysis. Are the results different? Why or why not?

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

The result before and after the data is standardized is different. Before the data is standardized, countries and other variables all gather together, which makes it hard to interpret; after the data is standardized, countries are more evenly distributed, and the other variables have more similar standard deviations( since the length of the arrows are almost the same).

4.Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.

After the human data is standardized, countries are distributed more evenly. The arrows shows the connections between the original features and the PC’s(PC1, PC2). The countries are placed on x and y coordinates defined by two PC’s. The angle between arrows represents the correlation between the features. Small angle = high positive correlation. We can see that except the correlation between “Parli.F”, “Labo.FM” and PC1, PC2, other variables all have high positive correlation with the PC’s.

The length of the arrows are proportional to the standard deviations of the features, from the plot, we can see that the variables have similar standard deviations.

5.Look at the structure and the dimensions of the tea data and visualize it. Interpret the results of the MCA and draw at least the variable biplot of the analysis.

Tea dataset includes 300 observations and 6 variables,which are:

“Tea”" : Factor 3 levels “black”,“Earl Grey”, “green” “How”" : Factor 4 levels “alone”,“lemon”, “milk”, “other” “how” : Factor 3 levels “tea bag”,“tea bag+unpackaged”, “unpackaged” “sugar” : Factor 2 levels “No.sugar”,“sugar” “where” : Factor 3 levels “chain store”, “chain store+tea shop”, “tea shop”
“lunch” : Factor 2 levels “lunch”,“Not.lunch”

The summary shows the detail of the data. It shows the amount of each variables as below:

Tea: How: how: sugar:
black : 74 alone:195 tea bag :170 No.sugar:155
Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
green : 33 milk : 63 unpackaged : 36
other: 9

where: lunch:
chain store :192 lunch : 44
chain store+tea shop: 78 Not.lunch:256
tea shop : 30

In tea variable, most data are “Earl Grey”(193); in How variable, most data are “alone”(195); in how variable, most data are “tea bag”(170); in sugar variable, most data are “No.sugar”(155); in where variable, most data are “chain store”(192); in lunch variable, most data are “Not.lunch”(256)

The visualization of the dataset visualize the summary, thus is easier for me to interpret the data. Next, I do multiple correspondence analysis. The summary shows the eigenvalues, individuals, categories and categorical variables. According to eigenvalues, we can see that Dim.1 and Dim.2 retain more percentage of variances than other dimensions. From v.test value in categories, the coordinate of “black”, “Earl Grey”, “green”, “lemon”, “milk”, “tea bag”, “tea bag+unpackaged”, “unpackaged” is significantly different from zero (since the value is below/above ± 1.96). According to categorical variables, we can see that “how” and “Dim.1”, “where” and “Dim.1”, have a stronger correlation.

MCA biplot shows the possible variable pattern. The distance between variables show the similarity between variables. For example, “lemon” and “alone” are more similar than “lemon” and “other”.